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Reading test data from the National Assessment of 
Educational Progress (NAEP) were scaled using the unidimensional item 
response theoty model. Data were collected for students aged 9, 13, 
and 17. To determine whether the responses to the reading items were 
consistent with unidimensionality, four different methods were 
applied: (1) principal component analysis of phi and tetrachoric 
correlation matrices; (2) principal component analysis of the image 
correlation matrix, a method based on th« work of Guttman; (3) R. D. 
Bock's full-information factor analysis; and (4) P. R. Rosenbaum's 
test of unidimensionality, monotonicity, and conditional 
independence. Balanced incomplete block (BIB) spiralling was used 
with this year's NAEP to assign test items to booklets. Thi<; 
permitted the estimation of inter-item correlations, but resulted in 
an unusual pattern of missing data. Results from the analyses 
conducted for each age group were different from the analysis of the 
25 it<ims administered in all three samples. It was concluded that it 
was not unreasonable to regard the reading items as measures of a 
single dimension of reading proficiency. (Author/GDC) 
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Abstract 



The reading data from the 1983-1984 NAEP survey were scaled using a 
unidimensional item responae theory model* To determine whether the 
responses to the reading items were consistent with unidimensionality, four 
methods were applied: principal component analysis of phi and tetrachoric 
correlation matrices; principal component analysis of the image correlation 
matrix, a method based on the work of Guttman (1953); Bock^s full~information 
factor analysis (Bock, Gibbons, and Muraki, 1983); and Rosenbaum's (1984a) 
test of unidimensionality, monotonicity, and conditional independence. 
Results indicated that it was not unreasonable to regard the reading items as 
measures of a single dimension. 
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!• The National Assessment of Educational Progress 

The National Assessment of Educational Progress (NAEP) is a congressionally 
mandated survey of the educational achievement of American students that has 
been conducted since 1969. Educational Testing Service assumed responsibility 
for it in 1983. During the 1983-1984 academic year (Year 15 of NAEP), ETS 
collected data on three so-called grages: 9/IV, 13/VIII, and 17/IX. In NAEP 
parlance, a grage is the union of an age, denoted by an Arabic numeral, and a 
grade; denoted by a Roman numeral. 

The subject areas assessed during Year 15 were reading and writing. 
Only the reading items are discussed in the present report. 

1.1 The unidimensionality assumption in item response theory 

In order to determine whether it was reasonable to regard the reading 
items administered in the Year 15 NAEP data collection as measures of a single 
construct, a series of analyses of the dimensionality of the reading data was 
performed. Dimensionality analyses were conducted both within and across the 
three grages, 9/IV, 13/VIII, and 17/IX. It was important to investigate the 
dimensionality issue because the validity of the item response theory (IRT) 
model used to estimate reading proficiency in the 1983-1984 NAEP survey rests 
on the assumption of unidimensionality. It should be note.d, however, that 
regardless of whether an IRT model is used, it is ordinarily assumed that items 
on an achievement test can be treated as measures of a single dimension, in 
this case, reading proficiency. Scoring a test by simply summing the item 
scores involves an implicit assumption of unidimensionality; IRT scaling 
formalizes this assumption. 

ER?C 8 ■ 
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The reading data were analyzed using the three-parameter logistic 
model (Bimbaum, 1968; Lord, 1980) in which Pij, the probability that subject 
i gets item j correct can be expressed as follows: 

P,j . PUtj • lie) ■ Cj . ,l7aj(0 , - bj) 

where 6^ is the proficiency parameter for person i, aj is the item 
discrimination parameter, bj is the item difficulty, and Cj can be 
interpreted as the probability that a person with very low ability gets item j 
correct. (Model parameters were estimated using BILOG (Mislevy and Bock, 
1982]; details are provided in a separate report on scaling.) In applyii ^ a 
model of this kind, it is assumed that the only examinee characteristic that 
affects item response is n single latent variable, 6. 
1.2 Robustness of IRT Estimation Procedures 

In practice, the assumption of unidimenaionality, required for the 
application of conventional IRT models, will always be violated to some degree. 
In order to make a more objective determination as to what constitutes an 
important departure from unidlmensionality, we need to know more about the 
robustness of the IRT estimation procedures to violations of th^ unidimension- 
ality assumption. Unfortunately, little work has been done in this area. 
Reckase (1979) and Drasgow and Parsons (1983) investigated the results of 
estimating the thrcs-parameter logistic model, using LOGIST (M. S. Wingersky, 
1983), under violations of the unidimensionality assumption. (The 
one**parameter and two*-parameter logistic models were also examined by Reckase, 
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1979, and Drasgow and Parsons, 1983, respectively*) Reckase's study was based 
on five £.ctual data sets and five data sets constructed to have specific 
factor structures. He concluded that LOGIST estimates "the first principal 
component when it is large relative to other factors ••«• gooo ability 
estimates can be obtained even when the first factor accounts for less 
tha;i 10 percent of the test variance, although item calibration results wll} 
be unstable. For acceptable calibration, the first factor should account for 
at least 20 percent of the test variance" (p« 228). Drasgow and Parsons 
(1983) made use of a hierarchical model with a general latent trait as well as 
five group factors to simulate various kinds of latent structures. One of 
their conclusions was that, in the simulated data designed to resemble 
"moderately heterogeneous achievement tests and actitude assessment 
instruments" (p. 193), LOGIST still recovered the latent trait and provided 
acceptable estimates of the item parameters (p> 198). lliere is no reason to 
believe that the effects of multidimensional 5.ty on BILOG (Mislevy and Bock, 
1982), which was used to scale the NAEP data, would differ from the results 
obtained with LOGIST (Mislevy, personal communication, October, 1985). These 
findings suggest that IRT scalimj procedures can produce, satisfactory vaults 
under moderate departures from unidimensionality. 

2. Methods of dimensionality assessment for dichotomous data 

The traditional psychometric approach to the assessment of dimensionality 
is through factor^analytic methods. Factor analysis often produces 
satisfactory results when each of the variables is the score on a multi'-ltem 
test, vnien each of the measures is the response to a dichotooously scored item, 
however, it is now well known that linear factor analysis of Pearson (phi) 
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correlar^lons does not, In general, yield a correct representation of the 
dimensionality of the item pool (see, e.g., Carroll, 1945, 1983; Hulin, 
Drasgow, and Paisons, 1983; McDonald and Ahlawat, 1974, Mislevy, in press). 
The fundamental problem is that in computing phi correlations, item responses 
are treated as true dichotomies. In applying a linear factor analysis model, 
we are hypothesizing that dichotomous variables are linear combinations of 
continuous latent variables with infinite range, a mathematical impossibility. 
In fact, the regression of a dichotomous item on a continuous latent variable 
must be nonlinear. The best linear approximation to the nonlinear regression 
will depend on the region in which the data are most dense (Mislevy, in press); 
that is, it will be related to the item mean, or difficulty (as defined in 
classical test theory). From this perspective, it is not surprising that 
linear factor analysis of dichotomous items often produces a second factor, 
typically called a difficulty factor, that is related to item difficulty, but 
appears to be unrelated to any substantive property of the item^. There can, 
In fact, be more than one such spurious factor (as is the case for items that 
form a perfect Guttman scale), but ordinarily, only one is substantial in 
size. 

A related problem with the phi coefficient, which can be regarded as 
another manifestation of the departure from the assumptions of classical 
factor analysis, is that its magnitude is determined in part by the relative 
values of the means of the two variables, which in this case are the item 
difficulties. Regardless of the underlying relationship between the items, 
the phi coefficient can reach unity only if the two items have identical 
proportions correct. 

11 
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As an alternative to phi coefficients, tetrachoric correlations between 
items can be obtained* In computing tetrachorics, it is assumed that the item 
responses are functions of underlying continuous variables that have a 
biva cate normal distribution. The model dictates that, for each item, 
individuals who have values greater than a certain threshold on the underlying 
response variable get that item cor.'ect; individuals with values lower than 
the threshold get it wrong. Using the bivariate normality ajssumption, the 
correlation between the unobserved continuous variables can be inferred from 
the 2x2 table of item responses* Of course, tetrachoric correlations do not 
pro>^ide a valid measure of association if bi^ ariate normality does not hold* 
Furthermore., the occurrence of guessing violates the above model, which 
postulates that the probability that an individual gets an item right is a 
function only of his value on the underlying response variable* When guessing 
does occur, factor analysis of tetrachorics can produce spurious factors (see 
Carroll, 19'>5, 1983; Hulin Drasgow, and Parsons, 1983)* Adjustments for 
guessing are theoretically possible, but often lead to unacceptable results in 
practice* (Attempts to adjust for the effects of guessing in the NAEP analyses 
are discussed in section 3*2*1*) Additional problems are inaccuracies in the 
computation of tetrachorics as they approach +1 or -1, the large standard 
errors of the coefficients, and the occurrence of non-Gramian matrices of 
sample tetrachorics, even when data are complete* (In the .^ase of the NAEP 
analyses, in which a large proportion of data are m^ising by design, the 
negative eigenvalues tend to comprise a large proportion of the trace of the 
tetrachoric mati.lx; see section 3*1*2 and Table 3*) 



ERLC 



12 



It is cl3ar that conventional factor analysis of phi and tetrachoric 
correlations is not a satisfactory means of investigating dimensionality. 
Unfortunately, no uniformly accepted statistical procedures for di-^ensionality 
assessment exist for the case of dlchotomous variables. As a result, a vast 
literature on the subject has developed, particularly during the last ten 
y****^, as the use of IRT models has increased. Me methods which have gained 
attention recently are briefly described here; more detailed reviews of 
dimensionality assessment are given by Hattie (1984, 1985), Hulin, Drasgow, 
and Parsons (1983, Chapter 8), and Mislevy (in press). 

Factor-analytic methods that have been proposed to overcome the problers 
described above include factor analysis of iteii parcels, nonlinear factor 
analysis, the generalized le.st squ&res methods developed by Chrlstof ferson 
(1975) and Muthen (1978) and the full-information maximum likelihood method of 
Bock (Bock, Gibbons, and Muraki, 1985). 

Factor analysis of item parcels is achieved by grouping items into 
meaningful subtests (the so-called parcels) and then applying conventional 
factor-analytic methods to the parcel scorec. This method was applied by Cook 
and Eignor (1984) to a portion of the NAEP data collected in 1979-1980 and by 
Cook, Eignor, Dorans, and Petersen (1985) to SAT data. One practical problem 
with tpis approach is that it may be difficult to classify certain items a 
priori. Furthermore, if the item parcels differ in average difficulty, the 
obtained factor structure may be influenced to an undesirable degree by item 
difficulty, as in the dichotomous case (Kingston and Dorans, 1982). A more 
fundamental drawback is that this approach does not assess directly the 
properties of individual items. Because item scores do not enter the 
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analysis, It is possible for items that measure a property other than the one 
of interest t ^ go undetected* Finally, tbe application of this approach to 
the complete NAEV data set Is virtually ruled out because examinees do not all 
receiva the same items (see section 3.1). (The Cook and Eignor [1984] 
analysis was based on a subset of examinees who had been administered the same 
items*) 

In a series of publications, McDon&ld presented a theory of nonlinear 
factor analysis (e.g., McDcnald, 1967, 1983). In McDonald's model, 
P(xij » 1 I §), the conditional probability that an examinee answers an item 
correctly, given his observed vector of latent traits, 0, is expressed as a 
nonlinear function of the latent traits. Fov example, in one version of the 
model, i^(xij » 1 | §) is expressed as a weighted sum of polynomial functions of 
the latent traits. Simulation studies of the ef fectivenei?s of nonlinear factor 
analysis as a method of dimensionality assessment have led to inconsistent 
findings. Hamoleton and Rovinelli (in press) found that a one-factor polynomial 
model with linear and quadratic terms provided a good fit to simulated 
unidimensional data set, unlike a one-factor linear model. Furthermore, a 
two-factor polynomial model provided a good fit to two-dimensional simulated 
data* Based on this and other findings, Hambleton and Rovinelli concluded 
that nonlinear factor analysis is one of the most promising methods for 
assessing .the dimensionality of dichotomous iata^ On the other hand, Hattie 
(1984) concluded that thti sum of absolute residual covariances from nonlinear 
factor analysis was not an effective index of dimensionality because tesults 
from the unidimensional and multidimensional data setd were not sufflcienuly 
distinct. 
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Christofferson (1975) developed a factor-analytic method for dlchotoraous 
data that Involves expressing the expected proportion correct for each item 
and for the joint proportions correct for each pair of items as a function of 
item thresholds (see above and section 3.4, below) and factor loadings. The 
weighted distance between the obser/ed and modeled values of these proportions 
is then minimized using generalized least squares (GLS) methods. 
Christofferson' s solution makes use of the Information contained in the 
three-and four-way margins of the n-wa' contingency table of item responses 
(see Christofferson, 1975, Appendix 2; riislevy, in press), unlike conv^entional 
factor analysis of phi or tetrachoric correlations, which makes use of only 
the one-and two-way marginals. Solving for estimates of the thresholds and 
loadings requires numerical integration and is therefore computationally 
burdensome. Muthen (1978) developed an alternative GLS method that reduces 
che computat >nal requirements to some degree. However, application of both 
Christofferson' 8 and Muthen* s methods is currently limited to about 25 items. 
Bock developed a factor-analytic approach for dichotomous data, called 
full-information factor analysis (Bock, Gibbons, and Muraki, 1985) because it 
. uses information contained in the joint frequer.cies of all orders of the item 
respons'^s. This method, detailed in section 3.4 below, makes use of the 
marginal maximum likelihood methods of Bock and Aitkin (1981) for estimating 
the parameters of the common factor model. 

In addition to factor-analytic approaches, a number of other methods of 
dimensionality assessment have been proposed. For example, Bejar (1980) has 
recommended comparing the estimated item difficulties (i*e., the estimates of 
the bj of equation 1) obtained by calibrating a complete set of test items 
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to those obtained by performing the calibration separately within content 
areas. (Bejar [1980] also proposed an additional procedure, which involves 
computing, for each content area, a scaled score corresponding to each of the 
two sets of item parameter estimates, and then comparing the results obtained 
by fitting a one-factor model to each of the two sets of scores.) Although 
Bejar* s (1980) application of the method appeared to yield useful results, 
Hambleton and Rovinelli (in press) found that the method was unable to 
discriminate between one**and two**dimensional simulated data sets. Another 
method that has been proposed is analysis of the residual differences between 
observed responses and the estimated probabilities of correct responses 
according to the unidimensional item response model deemed appropriate (e.g., 
equation 1). Various methods of residual analysis have been proposed; reviews 
are given by Traub and Wolfe (1981) and Hattie (1985). The rationale is that 
if the model fits well, the data can be assumed to be consistent with 
undimensionality. A major drawback is that large residuals may be the result 
of model violations other than multidimensionality. Ilambleton and Rovinelli 
(in press) concluded that indices based on the siz«> of average residuals 
obtained after fitting one-, two-, and three-parameter logistic models were 
not capable of detecting multidimensionality. It should be noted that 
Hambleton and Rovinelli did not report any investigation of the pattern of 
residuals. 

3. Methods used to assess the dimensionality of NAEF reading data 

The proposed methods of dimensionality assessment differ in terms of 
the assumptions needed, the hypothesis tested, and the statistical artifacts 
that affect interpretation* Rather than selecting a single method of 
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dimensionality assessment for the NAEP reading data, we applied four different 
techniques, described in this section. For descriptive purposes, we included 
principal components analysis (PCA) of phi and tetrachoric correlations, as 
described in section 3.2. As an experimental analysis, we also applied PCA to 
the image correlation matrix, a method baaed on the work of Guttman (1953) and 
Kaiser and Cemy (1979), described in section 3.3. Bock's full-information 
factor analysis, discussed in section 3.4, was applied to a subset of the data. 
Finally, we used the method of Rosenbaum (1984a, 1984b), described in section 
3,5, which involves examination of the partial association for each pair of 
items, conditional on the total score on the remaining items. Prior to a 
discussion of these methods, the properties of the NAEP data base art^ 
described* 

3.1 Properties of NAEP data 



were also spiral* ad with other items (see section 3.1.2 and scaling report) 
were used in the diuencionality analyses. All subjects who responded to one or 
more of these items were included. The number of subjects and items available 
for the analyses is shown in Table 1. As indicated, there were about ICO items 
per grage. Twenty**five of the items included in the analyses were administered 
to all three grages. The range and mean of the proportions correct for each of 
the three grages and for the 25 across-grage items are given in Table 1. As 
shown, the number of students per grage was roughly 26 to 29 thousand, 
corresponding to weighted frequencies of over 3 million. As a result of the 
number of items and subjects in the datatbarse, certain analyses were ruled out 



3.1.1 Items included in dimensionality analyses 



All reading items that were included in the IRT scaling and 





Table 1 



Number of Items and Students Available for 
Dimensionality Analyses 



6 rage 



Number of Proportions Correct 

Items Minimum Maximum Mean 



Number o£ Students 
Unweighted Weighted 



9 /IV 


108 


.04 


.93 


.50 


26,087 


13/VIII 


100 


.09 . 


.98 


.63 


28,405 


17/lX 


95 


.21 


.96 


.70 


28,861 


Across Grsges 
(Common Items) 


25 


.13 


.90 


.53 


83,353 



3*5 million 
3«3 million 
3*4 million 

10.2 million 
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because they were too cosnly or exceeded computing capabilities. In other cases > 
dimensionality analyses were performed on only a subset of items to minimize the 
cost and the computational burden. 

Ninety-four percent of the NAEP reading items included in the analyses were 
multiple choice items with three to six response choices. The mainder were 
essay items in which the respondent was asked to react to a reading passage. 
Essay items were scored on a scale of 1 to 5, which was later dichotomized. All 
items were classified by reading experts on the basis of objecti^^e (deriving 
information vs. integrating and applying information), stimulus (short or long 
reading passage, document, or picture), and content (fictional story, poem, 
informational passage, social studies, science^ arts and humanities, or life 
skills). These item properties, as well as a further classification of the items 
based on the work of Mosenthal (1985), were used in attempting to interpret 
analysis results. (A subset of leading items that were dc^signed to assess study 
skills were not included in the dimensionality analysis because they were not 
scaled using IRT. That these items differed from the remaining reading items was 
suggested t/y examination of the item content, as well as empirical evidence: For 
a subset of examinees, number-right scores on blocks of study skills items and on 
blocks of conventional reading items were obtained. The attenuation-corrected 
correlations between study skills blocks and conventional reading blocks tended to 
be lower than iuf ♦rcorrelations between conventional reading blocks. Many of the 
items which led to departures from unldimensionality in Jungeblut's [1984] analyses 
of the 1979-1980 NAEP data were study skills items [Jungeblut, personal 
conmunication, October, 1985].) 
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3«1«2 Missing data pattern 

A new feature of the year 15 NAEP design was the use of balanced 
Incomplete block (BIB) spiralling to assign test items to booklets (see Messick, 
Beaton, and Lord, 1983; Beaton, 1984)« BIB spiralling combines the features of 
conventional spiralling and multiple matrix sampling. As in ordinary multiple 
matrix sampling, each item is administered a prescribed number of times, 
although examinees receive different subsets of items. BIB spiralling has the 
additional feature that each pair of items is assessed a prescribed number of 
times* In NAEP, reading items were first grouped into blocks, consisting in 
most cases of 8 to 12 items, which were then assigned to test booklets according 
to a design that conformed to these criteria* This resulted in a set of 
approximately 60 different test booklets per grage, which' were assigned to 
respondents in a random sequence* 

A major advantage of BIB spiralling is that it permits the estimation of 
inter-item correlations* However, the resulting matrix of correlations, referred 
to here as the BIB matrix, has an unusual pattern of missing data* In the case 
of the NAEP reading data, the number of respondents available to estimate 
correlations between items in the same block is, in most cases, nine times the 
number of respondents available for the estimation of correlations between items 
that fall within different blocks* Furthermore, the correlations of items in one 
block, say. A, with those in another block, B, are not in general based on the 
same group of respondents as the correlations of Block C items with Block D 
items* Because of the spiralling procedure used to assign booklets to 
respondents, the missing data that result from the implementation of a BIB design 
can be regarded as random* However, in using a BIB correlation matrix rather 
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than a conventional correlation matrix, we are Implicitly making the assumption 
that the correlations between Items are not subject to context effects. If, for 
example, the population correlation between two Items, 1 and j, varied depending 
on whether k were administered with 1 and j, then the sample correlation of 1 
and j In the presence of k would not be an eetlmate of the same population 
paradieter as the sample correlation of 1 and j In the absence of k« Computation 
of a BIB matrix would Involve averaging these sample correlations, which would 
be undeslrablet 

Even if the assumption of no context effects is justified, there are 
other ways in which the properties of the BIB matrix differ from those of a 
conventional correlation matrixt For example, the standard errors of the 
wlthln*-block correlations are smaller than those of the between^'block 
correlations. Also, the BIB matrix may have negative eigenvalues, unlike a 
conventional correlation matrix* As detailed in sectior 3*1 and Tables 2 and 
3, both phi and tetrachoric matrices of NAEP items had negative roots in most 
cases* For analyses that required a matrix that was at least positive 
semi-definite, an adjustnent procedure, described in Appendix 1, was applied* 
Although thexc is no indication that analysis results were affected in any major 
way by the use of BIB matrices or their adjusted counterparts, the statistical 
properties of these matrices are not fully understood at present* 

In addition to the BIB missing data, which can be regarded as random^ 
there are two major categories of non*-random missing data: omitted items and 
items that the respondent was administered but did not reach* Unanswered 
items occurring after the last valid response within a block were considered 
**not reached*** (In administering the items, each block was timed separately*) 
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Unanswered items that occurred prior to the last valid response (and were not a 
result of the BIB design) were coded as omits. The category of omitted items was 
defined to include as well any items marked, "I don't know,** which was a response 
alternative for all multiple choice items. The treatment of not reached and 
omitted items \n each of the dimensionality analyses is discussed in sections S.Z*- 

3.2 Principal component analysis of inter-item correlation matrices 
Despite the drawbacks described in section 2, principal component 
analyses (FCA) of the phi and tetrachoric matrices for each grage were 
conducted for descriptive purposes. In addition, analyses including all 
respondents were performed, based on the 25 items common to all three grages. 
It can be argued that the results of these analyses represent a "worst case;" 
that is, because the analyses tend to produce spurious factors, rr^sults that 
were free of artifacts would je expected to be more consistent with 
unidimensionality. 

Items that were not reached were excluded from the analysis; omitted items 
were scoxt^d as Incorrect. For each of the four phi matrices. Table 2 gives the 
range of inter-item correlations, the median correlation, the first five 
eigenvalues and the percent of the trace they represent, and, as an index of the 
degree to which the matrix departed from positivc-def initeness, the sum of the 
negative eigenvalues as a percent of the trace of the matrix. The range of 
sample sizes (N) on which the correlation coefficients were based (see section 
3.1.2) is also given. The corresponding information for the tetrachoric matrices 
is given in Table 3. The results in Tables 2 and 3 are based on analyses that 

Er|c ^2 



incorporated the respondents* sampling weights (see Lago, Burke, Tepping, and 
Hansen (1985). Unweighted analyses yielded almost identical results. 

It is clear that, for each of the eight matrices, there is a large first 
root, constituting between 17 and 25 percent of the trace for the phi matrices 
and between 30 and 40 percent for the tetrachoric matrices (but note that the 
negative roots constitute up to 27 percent of the trace for tetrachoric 
matrices). The second root is always less than one-fourth of the first. 
Following the sharp drop-off between the first and the second, the remaining 
roots trail off gradually. These findings are reassuring in that they are 
consistent with a large first dimension. (The size of the first couiponent may 
appear small to those who are unaccustomed to examining the results of 
item-level factor analyses. In interpreting these findings, however, it is 
important to consider that the median inter-item correlations are low: 
between •lA and .19 for the four phi matrices and between •27 and .35 for the 
tetrachoric matrices. Results of PCA of phi matrices computed from simulated 
unidimensional data showed that the first root typically constituted 25 to 30 
percent of the trace; see section' 3.3 and Table 5.) The loadings on the 
first principal component were not related in any obvious way to the item 
classifications discussed in section 3.1.1. 

3.2.1 Application of guessing corrections to tetrachoric 
correlations 

When it is possible for items to be answered correctly 
through guessing, the magnitude of observed tetrachoric correlations is 
related to item difficulty (e.g., see Hulin, Drasgow, and Parsons, 1983, 
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Table 2 

Eigenvalues and Descriptive Statistics for Phi Matrices 

Grage 9/IV (108 items) 
First 5 Roots Pet. of trace Descriptive Statistics 



23.9 22 Range of N 149, 5502 

3.3 3 

2.5 2 Range of r -.18, .53 
2. A 2 Median r .19 

2.2 2 Neg. roots as pet. of trice 3 

Grage 13/VIII (100 items) 

First 5 Roots Pet. of trace Descriptive Statistics 

17.0 17 Range of N 160, 4502 

2.6 3 

2.5 2 Range of r -.15, .60 

2.2 2 Median r .14 

2.1 2 Neg. roots as pet. of trace 2 

Gra^e I7/IX (55 items) 

First 5 Roots Pet. of trace Descriptive Statistics 

17.5 18 Range of N 167, 4659 

3.1 3 

2.3 2 Range of r -.16, .68 

2.1 2 Median r .16 

2.0 2 Neg. roots as pet. of trace 2 

All Grages Combined (25 items) 

First 5 Roots Pet. of trace Descriptive Statistics 

6.3 25 Range of N 607, 8862 

1.5 6 

1.2 5 Range of r .23, .57 

1.1 5 . Median r .18 

1.0 4 Neg. roots as pet. of trace 0 
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Table 3 

Eigenvalues and Descriptive Statistics for Tetrachoric Matrices 

Grage 9/IV (108 items) 
First 5 Roots Pet. of trace Descriptive Statistics 



39.5 37 Range of N 149, 5502 

6.6 6 

4.7 4 Range of r -.46, .81 

3.7 3 Median r .35 

3.4 3 Neg. roots as pet. of trace 27 

Orage 13/VIII (100 items) ' 

First 5 Roots Pet. of trace . Descriptive Statistics 

30.0 30 Range of N 160, 4502 

4.3 4 

3.8 4 Range of r -.34, .81 

3.4 3 Median r .27 

3o3 3 Neg. roots as pet. of trace 21 

Grage 17/IX (95 items) 

First 5 Roots Pet. of trace Descriptive Statistics 

32.0 34 Range of N 167, 4659 

3.9 4 

3.3 3 Range of r -.38, 90 

3.0 3 Median r .31 

2.8 3 Neg. roots as pet* of trace 19 



All Grages Combined (25 items) 

First 5 Roots Pet. of trace Descriptive Statistics 

10.0 40 Range of N 607, 8862 

1.6 6 

1.2 5 Range of r .05, .80 

1.2 5 Median r .33 

1.0 4 Neg. roots as pet. of trace 0 
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pp. 249-255). To eliminate this problem, Carroll (1945) suggested that the 
frequencies in the 2x2 tables of responses for each pair of items be 
adjusted to ** remove** the effects of guessing and that tetrachorics be computed 
on the basis of these adjusted frequencies. In Carroll's model, it is assumed 
that guessing is random and that the probability of getting an item right by 
guessing is therefore equal to the reciprocal of the number of response 
choices. It is also implicitly assumed that, for each pair of items, the 
probability of getting one Uem right by guessing is independent of the 
probability of making a correct guess on the other item. To determine whether 
it would be a useful strategy for NAEF data, Carroll's correction was applied 
to the item responses for grage 13/VIII, setting gj, the hypothetical 
probability of guessing right on item j, equal to the reciprocal of the number 
of response choices for item j, excluding the "I don't know" alternative. For 
essay items, gj :*a8 set to 0. The results were clearly unsatisfactory: It 
was found that 16 percent of the tetrachoric coefficients were rendered 
incomputable because of negative adjusted cell frequencies. Several other 
corrections were investigated, but deemed unsatisfactory, including a 
modification of Carroll's correction in which the input gj values were 
adjusted so as to avoid the occurrence of negative adjusted cell frequencies 
and a correction in which each gj was set equal to the estimated lower 
asymptote, Cj (see equation 1) of the item from the IRT iteoi calibration. 
Note that Bock; Gibbons, and Muraki [1985] describe a modification of 
Carroll's correction that apparently produces satisfactory results.) 
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3.3 Principal components analysis of the Image correlation matrix 

Guttman (1953) developed a theory for the structure of quantitative 
varlates called image theory. Image theory Is based on the partitioning of a 
variable Into two additive segments: the part that can be predicted through 
least squares linear regression of that variable on all the remaining variables, 
called the Image, and the error of prediction, called the antl-*image. Thus, 
unlike common factor theory. Image, theory provides an explicit definition for 
the common part of a variable. Another difference from the traditional 
factor-analytic approach is that the anti-images may have non-zero covariances. 
Guttm&n shows that common factor theory may be viewed as a special case of image 
theory. The relation between Image theory and other factor-analytic approaches 
is further examlm.d by Harris (1962) and re/iewed by Mulaik (1972). 

Suppose that n vari^ibles are to be observed. The decomposition of the 
original varlates into Images and anti-images can be expressed as 

z « V + u [2] 
«»«»«» 

where z is the n x 1 vector of observable random variables, standardized to 
have mean zero and unit variance, y is the n x 1 vector random variable of 
Images defined in equation 3, below, and u in the n x 1 vector random variable of 
anti-images, or errors of prediction. (When referring to a finite sample of 
variables, Guttman used the terms partial image and partial anti-image. The 
qualifier, "^partial" will not be used here.) The n x 1 vector random variable 
y of Images can he expressed as 

V « Wz 13] 

27 
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The weight matrix V is defined as 

W - I - s2 R-1 [4] 

where R is the correlation matrix of the original variates, z, and 

S2 - [diag (R-^)]-l [5] 

The off-diagonals of W contain the regression weights for predicting each of 
the variates z from the remaining n 1 variates. The diagonals of W are equal 
to zero because the regression of a variate on itself is not of interest. 

The principles of image theory are usually applied in practice by 
factor-analyzing G, the covariance matrix of the images, given by 



6 - E(w») « E (Wz) (Wz)* 



[6] 



E(Wzz*W») « W E(zz*) W» 



WRW* « (I - s2r-1) R (I - s2 R--1)» 



R + S2 R-l S2 - 2s2 



The j^^ diagonal element of this matri.< is the variance of the j^^ image, 
vhich is equal to the squared multiple correlation coefficient (SNC) obtained 
by regressing the j^^ variate on the remaining n - 1 variates. -In this sense, 
G resembles the ^'reduced correl£^.ion matrix** of common factor analysis with 
SMCs used as communality estiuiates. The off-diagonals of G, however, tend to 
be slJ.ghtly smaller than those of the reduced correlation matrix (Kaiser, 
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1963); furthermore, G is always Gramlan (assuming data are complete), unlike a 
correlation matrix Wxth SMCs inserted in the diagonal. 

As an alternative to the analysis of che G matrix, Kaiser and Cemy 
(1979) recommended principal component analysis of the image correlation 
matrix, G*, given by 

G* - D-1/2 G D-1/2 (7] 

«» «W <w w 

where 

D « diag (G) - I - S2 [8] 

Kaiser (1970; see also Kaiser and Cemy, 1979) conjectured that image 
analysis would be well**suited to the factor analysis of dlchotomoun data. He 
noted that because the images are least squares predicted values of one 
variate based on the remaining n - 1 variates, "a crude appeal to the Central 
Limit Theorem suggests that the images will be sensibly multivariate normal, a 
set-nip which is well known not to produce difficulty factors" (Kaiser, 1970, 
p. 407. Although McDonald and Ahlawat (1974) expressed doubt about the 
utility of this approach, some unpublished work by Meredith (personal 
communication, September, 1985) provided partial confirmation of Kaiser* s 
conjecture. 

As an experimental approach to dimensionality assessment, principal 
component analysis of the image correlation matrix was applied to the NAEP 
da .a ifor gragcs 9/IV, 13/VIII, and 17/IX and to the 25 across-grage items. 
Modification of the standard equations of image ilnalysis was required because, 
in the case of NAEP data, the matrix R of weighted phi correlations is not 
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posltive definite (see section 3.2 and Table 2) and therefore can not be 
inverted* An adjustment procedure, detailed in Appendix 1, was used to obtain 
a singular approximation to the matrix of Inter-ltan correlations and a 
pseudo-^lnverse of this adjusted matrix. Following this, the pseudo-Inverse 
matrix was then substituted for R"*! In the formulas for W and (equations 
3 and 4), as recommended by Kaiser and Cemy (1978). Analogues of the 
matrices 6, 6*, and D (equations 6, 7, and 8) were computed using these 
modified foras of W and 

The first five roots of the image correlation matrix are given in Table 4 
for the three grages and for the across-grage analysis. For the three within- 
grage analyses, the first roots are between 14 and 47 percent larger than 
those for the Pearson matrix* There are at least two possible reasons for 
this. One distinction between the two FCA methods, which applies regardless 
of whether the data are dichotomous, is that the FCA of the Fearson matrix 
involves the correlations of observed values on the original variates z 
(equation 2), whereas FCA of the 6* matrix involves the correlations of the 
common parts, v, of the items as defined in equations 2-5. This difference 
would be expected to result in larger first roots for the image approach. 
Furthermore, in the present application of image analysis, the problems 
associated with linear factor analysis of dichotomous data are to some degree 
ameliorated by using a matrix of correlations between weighted sums of 
dichotomous item scores. This, of course, was the basis for Kaiser* s 
conjecture that the image approach would work well in the dichotomous case. 
It is somewhat surprising that the second roots are also substantially larger 
for the image matrix than for the Fearson matrix. This is most obvious in 
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Table 4 

Eigenvalues of the Imsge Correlation Matrix 

Grage 9/lV (108 Items) 

First 5 Roots Pet. of trace 

27.3 25 

9.5 9 

3.7 3 

3.2 3 

2.7 3 

Grage 13/VlII (100 Items) 
First 5 Roots Pet. of trace 

23.2 23 

9.5 9 
3.9 4 

2.8 3 

2.6 3 

Grage 17/lX (95 Items) 
First 5 Roots Pet. of trace 

25.8 27 

5.7 6 

4.3 4 

3.4 4 
3.3 3 

All Grages Combined (25 Items) 
First 5 Roots Pet. of trace 

18.0 72 

2.0 8 

1.1 5 
0.7 3 
0.6 2 
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grage 13/VIII, where the second root of the image correlation matrix is more 
than three times as large as the second root of the Pearson matrix* 

For each of the three within-grage analyses, a solution conforming to the 
principles of simple structure could be obtained using Promax rotation 
(Hendrlckson and White, 1964)* In order to interpret the factors, the 
relation between the loadings for the rotated solutions and the classified^ 
tions of reading items described in section 3*K1 was examined* No clear 
pattern emerged, however* Furthermore, for items that were administered to 
more than one grage, there was no consistency across grages in the 
configurations of loadings* 

Results for the 25 items that were administered to a]l three grages were 
substantially different from the within-grage analyses* The first root of the 
image correlation matrix constituted more than seventy percent of the trace, a 
finding that appears consistent with unidimensionality* The first root was 
nearly three times the size of the first root of the Pearson matrix; the 
second root grew only slightly in this case* It is likely that results of 
this analysis differed from those of the wlthin**grage analysis because the 
across**grage correlation matrix was better-behaved* The sample sizes were 
larger and there were no negative correlations or negative roots* 
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To aid In Interpreting the tosults of the four Image analyses, PCA of the 
Image correlation matrix vas applied to several simulated data sats generated 
from a unldlmenslonal model* The simulation studies were conducted as 
follows: (1) Assuming a three-parameter logistic model, NAEP reading Items 
were calibrated with the LOGIST program (M. S. Wlngersky, 1983) using actual 
NAEP data» Thirty of these Items were randomly selected for this simulation 
run* (2) One thousand pseudo-random values from a normal distribution with 
mean zero and unit variance were then generated* These represent theta or 
proficiency values for N » 1000 examinees* (3) For each examinee, the 
three-parameter logistic function (equation 1) was used to obtain the n x N « 
30 X 1000 values of Pij, the probability that person 1 gets Item j correct* 
The Item parameters aj, bj, and cj, were obtained from step 1 and the 
values from step 2* (4) Corresponding to each value of Pj^j, a pseudo-random 
value U^j was generated from a uniform distribution on the Interval [0,1]* If 
Uij was less than Pij, Item j was scored as correct for person 1; otherwise It 
was scored as Incorrect* The correlation matrix of these simulated data was 
then obtained and the image procedure applied* 

Table 5 shows the first five roots of the phi and Image correlation 
matrices for one of the simulated data sets* Results were much more dramatic than 
for the withln-grage analyses of the actual NAEP data*, the findings bore a closer 
resemblance to the across**grage analysis of 25 Items* Whereas the first root of 
the phi matrix was only about one quarter of the trace In the simulation, the 
first root of the Image correlation matrix was about 80 percent of the trace* 
Other simulated unldlmenslonal data sets produced similar values* If the size of 
the first root Is used as a criterion, the Image analysis technique Is superior 
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Table 5 
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First Five Eigenvalues of Correlation and Image 
Correlation Matrices for Simulation Data 
(30 Items with NAEP Item parameters) 



Phi Matrix Image Correlation Matrix 

First 5 Pet. of First 5 Pet. of 

Roots Trace Roots Trace 



7.7 26 23.8 79 

1.7 6 2.6 9 

1.1 4 0.5 2 

1*0 3 0.5 2 

1.0 3 0.4 1 



Correlation of Loadings on Second Principal 
Component with Proportions Correct 



.85 .65 
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to PCA of the phi matrix in revealing the true unidimensional structure 
underlying the data* However, as in the case of the phi matrix, the loadings 
of items on the second principal component of the image correlation matrix 
have substantial correlations with the proportions correct for the items: the 
correlations were •SS for the phi matrix and •65 for the image correlation 
matrix* Because it is evident that the results of the PCA of the image 
correlation matrix are not free of statistical artifacts, no further attempt 
was made to interpret the Promax solutions* (It should also be noted that no 
simulation studies of the performance of image analysis und<^r multidimension** 
ality were conducted*) 

3*4 Bock's full**inf ormation factor analysis 

Another factor^analytic method that was applied to the NAEP data is 
Bock's full-infonnation factor analysis (Bock, Gibbons, and Muraki, 1985; see 
also Mlslevy, in press), which is implemented in the TESTFACT program (Wilson, 
Wood, and Gibbons, 1983)* Unlike the methods described in sections 3*2 and 
3*3, this method does not require the computation of correlation coefficents, 
but operates instead on the n-*way contingency table of item responses* In 
contrast to factor analysis of correlation coefficients, which makes use of 
only the pairvise Joint frequencies of item responses. Bock's full-information 
solution uses information contained in the Joint frequencies of all orders* 
In applying this method, a particular model for the item responses must be 
assumed* In the case of the NAEP data, the selected model was a multivariate 
generalization of the three-parameter normal ogive in which each item is 
allowed to load on multiple factors* The model can be, developed by first 
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assuming that underlying the response of person i to item j is a response 
process variable defined as 

K 

yij - I ^jk Ski + vj [9] 
k-i 

where Qy^i represents the value of the k^^ latent variable (factor), k » 
1, 2, ... K, for the i^h individual, i - 1, 2, ... N, Xjk is the 
loading of the j^^ item, j - 1, 2, ... n, on the k^^ latent variable, and 
Vj ir a residual term associated with item j« The observed score of the 

ith 

examinee on the j^^ item, Xj[j , takes on a value of 1, indicating a 
correct score, if yij exceeds Yjt the threshold for the j^^ item. If 
it is assumed that the residuals Vj are independently distributed as N(0, 
<7j), the conditional probability that the i^^ examinee gets the j^b item 
correct, given that his values on the latent variable are equal to Qi 
can be expressed as 

K 

J y - I ^jk^ki 
P(xij . 1 I 6i) - ^^/yj [-1/2 ( flO] 

= Fj(6i) 

This is a multivariate generalization of the two^parameter normal ogive model 
(see Lord and Novick, 1968). 
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This model can be modified to allow for the possibility of guessing by 
substituting 

Fj (gi) « cj + (1 - Cj) Fj(3i) [11] 

for Fj(G£), where Cj represents the probability that an individual 

with very low ability gets the item correct. This multivariate generalization 

of the three-parameter normal ogive model was applied in the NAEP analyses. 

The Cj parameters were estimated using BIL06 (Mislevy and Bock, 1982) and 

then input to the TESTFACT program. NAEP items that were cod as "not 

reached'* (see section 3. 1.2) were not included in the analysis. Omitted 

items, on the other hand, were scored correct with probability Cj. Under 

this strategy, examinees who omit an item have the same theoretical probability 

of getting the item correct as examinees who guess in the absence of any 

information. 

Incorporating the item response function, FjCg^), defined in 
Equation U, the marginal probability of the s^^ response pattern can be 
expressed as: 

Pg - p(x - xg) « n (§)'^®J[i - F? (e)]^"^8jf(0)d0 [12] 

-a> -00 j«l 
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where Xgj is the response to the j^^ Item In the s^^ response patfem, s * 1, 
2y and S < min (2", N) is the number of response patterns. It Is 

further assumed in this application that f(0) is the multivariate normal 
distribution with mean 0 and covariance matrix !• Now. if it is assumed that 
the counts of the distinct response patterns follow a multinomial distribution » 
the likelihood of the matrix X of observed counts rg of distinct response 
patcems can be expressed as: 

Nl n ro 

P(X) - —1— I 7 I Pi ••• 113] 

r J * r2 * • • • rg * 1 Z s 

where Pg is given by Equation 12. 

The quantities Pg are estimated using numerical Integration techniques. 
The m&rginal maximum likelihood method of Bock and Aitkin (1981), which is 
based on earlier work by Bock and Lieberman (1970), is then applied to 
Equation 13 to obtain estimates of the factor loadings and thresholds for each 
item (see Bock» Gibbons, and Mural.i, 1985; Mlslevy, in press). 

If sample size is sufficiently large, a test of the fit of the K-*f actor 
model relative to a general multinomial alternative can be obtained using a 
chi-square approximation to the likelihood ratio test. The model can be 
re-estimated and the test repeated for successive values of K. The difference 
between these chi-*square statistics is also distributed as chi-*square (under 
the hypothesis that the more restrictive model is correct) and can be used to 
test the Improvement in model fit that is achieved by allowing the number of 
latent variables to increase. The test of change in model fit has been shown 
to perform well even when the frequency table is sparse (Haberman, 1977). 
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Because the TESTFACT program Is very expensive to run, full-Information 
factor analysis was applied only to 42 Items for grage 13/VIII. These Items, 
which were chosen to maximize the chances of detecting multldlmenslonallty, 
were Intended to represent four distinct Item types: reading comprehension, 
vocabulary, life skills and essay. The comprehension, vocabulary, and essay 
items all referred to passages the examinee was asked to read* Some passages 
were fictional stories; others pertained to an academic content area, such as 
science or social studies. The life skills items were based on documents that 
might be encountered in everyday life, such as a portion of a telephone 
directory, a grocery store coupon, or an advertisement. 

The analysis was based on the raw rather than the weighted frequency 
table of item responses. Because sampling weights have little effect on 
variances and covariances, thev are unlikely to have much effect on factor 
analysis results (Bock, personal communication, November, 1985). 

In applying the chi-square test for the number of latent variables or 
factors^ it was necessary to take into account the effects of multistage cluster 
sampling (see Lago et al«, 1985) ofi the variability of the test statistic. In 
adjusting the significance tests, it was assumed that the design effect 
was equal to two. Research conducted with previous NAEP surveys led to the 
conclusion that this was a reasonable estimate of the design effect for this type 
of test statistic (Johnson, 1980). This means that an estimate of the variability 
of the test statistic under the NAEP sample design can be obtained by computing 
the variance of the statistic under simple random sampling assumptions and then 
multiplying the obtained value by two. Because the log likelihood chl-square 
statistics are proportional to sample size, a design effect can be Incorporated 
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simply by dividing the chi-square values by the design effect. Incorporating 
this adjustment, the chi-square test corresponding to the change from the one- to 
the two-factor solution was not significant , indicating that the one-factor 
solution could be retained. The single factor accounted for about 39 percent of 
the total variance. Reading comprehension items, particularly those that 
involved fictional stories, tended to havr the highest factor loadings. Life 
skills items had the lowest loadings,, 

3.5 Rosenbaum^s test of unidimensionality, monotonicity, and conditional 
independence 

Rosenbaum (1984a) proves a theorem that states that if item 
characteristic curves are nondecreasing functions of a single latent variable, 
then conditional (local) independence of item responses, given the latent 
variable, implies certain relations among the item responses. Specifically, 
the conditional covarlances between all monotone increasing functions of a set 
of item responses, given any function of the remaining item responses, will be 
non-negative. This theorem can be used to develop statistical tests of whether 
an observed data set is consistent with the asstimptions of monotonicity, 
unidimensionality, and conditional independence. (See Holland, 1981, Holland and 
Rosenbaum, in press, and Stout, 1984, for further discussion of tests of this 
kind. ) 

As a special case of Rosenbaum* s theorem, we can test the partial 
association for each pair of items, given number-right score on the remaining 
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items^ using the Mantel--Haen8zel (1959) test, a conventional procedure for 
analysis of discrete data* In this case, we are examining the conditional 
covariance between monotone item summaries which are simply responses to a single 
item. The function on which we are conditioning is the number^rlght score on the 
remaining n 2 items. To perform the Mantel-Haeiiszel test for a particular 
item pair, a 2 x 2 table of item responses is constructed for each of the K 
possible values of number-right score on the remaining items* Let n^jk be 
the observed count in the 

ith 

row, jth column, and k*^^ table, where 
1 " 1» 0; j ■ 3 '3; and k « 1, 2, ... K. Thj Mantel-Haenszel te t statistic 
is given by 

nin- - E(nin.) + 1/2 
[14] 

where E(nii+) and V(nii+) denote the hype rgeome trie expectation and variance 
of n(ll+), given by 

K ni+k no+k J^+lk^+Ok 
V(nii+) - I [16] 

and the plus subscript indicates summation over that subscript. The 
approximate significance level is obtained by referring Z tc the lower tail of 
the standard normal distribution. A statistically significant result 
indicates that t.a pair of items has a negative partial association and is 
thus Inconsistent with the hypothesized model. 
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The Mantel'-Haenszel approach was programmed to accommodate the 
complexities of BIB spiralling in the following way: Suppose that we are 
interested in assessing the conditional covariance between items and 
X2 and that, because of BIB spiralling, certain students who received items 
Xi and X2 also received X3, X4, and X5, whereas others received X5 and X^* 
The test of association between Xi and X2 is th«»n based on seven 2x2 tables: 
four corresponding to the possible score values for X3 + X4 + X5 and three for 
the possible scores for X5 + Xg. Because^ of the spiralling method used to 
assign booklets to respondents (see section 3.1 •2), the fact that respondents 
did not all receive the same items or even the same number of items does not 
impair the validity of the method. Items that were omitted or were 
administered but not reached (see section 3.1.2) were scored as incorrect* 

Because of the cost of computations, the Rosenbaum method was applied to 
only a subset of the NAEP items: those in blocks H, K, M, N, and 0. The 
number of items per grage was 56 for grage 9/IV, 53 for grage 13/VIII, and 55 
for grage 17/lX* The number of hypothesis tests, which is equal to the number 
of item pairs, was 1540, 1378, and 1485 for grages 9/lV, 13/VIII, and 17/IX 
respectively. In order to evaluate the findings of this method, a decision 
must be made about the appropriate alpha level at waich to test these multiple 
hypotheses. Whereas on one hand, we would like to control the overall Type I 
error rate at an acceptable level, we do not want to maintain such rigorous 
Type I error control that a rejection of the hypothesis of unidimensionality 
would be impossible. As it turns out, even if the alpha for each hypothesis 
test is set at 01, a liberal alpha level for so large a number of tests, the 
number of statiscically significant negative partial associations is only 4 
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Table 6 

Results of Rosenbaum Analyses 
Within-Grage Anayses 

Grage 



9/IV 13/VIII 17/IX 

Nv*'*ber of items 56 53 55 

Number of item pairs 1540 1378 1485 
Number of significant 
negative partia!' associations: 

a » per comparison 4 4 6 

a ■ .05 per comparison 31 29 26 



Across^rage Analyses 

Grage pair 
9 & 13 9 & 17 13 & 17 

Number of comparisons 24 24 24 

Number of significant 
negative partial associations: 

a « .05 per comparison 0 0 0 



i3 



for grage 9/IV, 4 for grage 13/VIII, and 6 for grage 17/lX. If alpha is set at 
• 05 for each test, the number of stat-'.stically significant results is 31, 29, 
and 26 for the three grages, respectively (see Table 6), Therefore, it is 
reasonable to retain the hypothesis that the item responses can be represented 
by a monotonic unidiiacnsional latent variable model with conditional 
independence^ It should be noted that application of the Rosenbaum method 
does not provide a test of the fit of the three*-parameter logistic model or of 
any other specific model. 

In applying the Rosenbaum method, no modif icaticns were incorporated to 
reflect NAEP's complex multistage cluster sampling scheme (Lago et dl«, 1985). 
That is, raw rather than weighted frequencies were used in the analysis and no 
jackknifing or design effect adjustment was used in comjpucing the significance 
probabilities of the Mantel-Haenszel statistics. As nr::^d in section it2, 
weighted and unweighted correlation matrices for the NAEP data are virtually 
identical, suggesting that the weights would make ^ittle difference in the 
Rosenbaum analyses. Furthermore, the dasign effect for these tests is likely 
to be greater than one, as in 3«4, Adjustment of the significance tests would 
then lead to a reduction in the number of item pairs found to have negative 
partial associations! thus reinforcing the original conclusion about 
dimensionality. 

3.5.1 Across**grage analyses 

In addition to determining whether it was reasonable to 
regard the reading items as unidimensional within each grage, it was of 
interest to investigate whether unidimensionality would hold if respondents 
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from all three grages were included. Of the entire set of items available for 
dimensionality analyses (Table 1), 25 were administered to all three grages. 
Twenty-four of these 25 were in the item blocks (H, K, M, N, 0) used for the 
Rosenbaum analyses. A method developed by Rosenbaum <1984b), which is a 
variant of the approach described above, was applied to these 24 items. The 
procedure provides a test of whether the item responses of two groups of 
examinees is consistent with a difference in the distribution of a 
unidimensional latent variable. A rejection of this hypothesis would mean 
that it was necessary to postulate the existence of additional dimensions. As 
a first step in the analysis, an indicator variable is created to represent 
group membership, with the higher value associated with the group hypothesized 
to have generally higher values on the latent variable. If the pattern of 
Item responses is consistent with the hypothesized model, the conditional 
covarlances of each item with the indicator variable will be non-negative, as 
described in 3.5. 

For the NAEP data, a separate analysis was conducted for each pair of 
grages, as follows: An indicator variable representing grage was created» 
with a value of 1 indicating the higher grage and the value of 0 corresponding 
to the lower grage. The partial association of each of the 24 items with 
grage was tlien assessed, using the Mantel-Haenszel (1959) test, as described 
in 3.5. With an alpha of .03 for each of the 24 hypothesis tests per grage 
(see Table 6), no significant negative partial associations of items with the 
dummy-coded grage variable were found. This means that, as we would expect 
intuitively, students in higher jr^.ages were more likely than students in lower 
grages to answer items correctly, conditional on number-right score on the 



45 



-39- 



remaining items* These results are consistent with unidimensionality of the 
item pool* 

4* Conclusions 

Overall » the four dimensionality analyses of the NAEP reading items 
indicate that it is not unreasonable to treat the data as undimensional* As a 
preliminary approach, principal component analyses of phi and tetrachorlc 
correlation matrices were computed for each of the three grages and for the 25 
across-grage items* The first roots obtained from these analyses were sizeable, 
ranging from 17 to 25 percent of the trace for the phi matrices and 30 to 40 
percent for the tetrachorlc matrices* (For simulated unidimensional data, the 
first root of the phi matrix typically constituted 25 to 30 percent of the 
trace*) 

As an e. Tterlmental method, a factor-analytic approach based on Guttman's 
image theory was also applied* Principal component analysis of the image 
correlation matrices yielded larger first roots than PCA of the corresponding 
phi matrices, but larger second roots as well* Application of image analysis to 
simulated unidimensional data showed that principal component loadings had a 
substantial correlation with the proportions correct for the items* Thus, 
the image approach does not avoid the artifacts ^^ssociated with the application 
of linear factor-analytic methods to dichotomous data* 

Application of Bock^s full-information factor analysis to a subset of the 
grage 13/VIII data led to a satisfactory fit with a one" factor model* The first 
factor accounted for 29 percent of the total variance* Reading* comprehension 
« items involving fictional stories had the highest loadings on this factor; life 

skills items had the lowest* 
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Finally, the Mantel-Haenszel approach developed by Rosenbaum led to a 
retention of the hypothesis that the data can be represented by a unidimensional 
latent variable model with conditional independence. In addition to analyses 
within each grage, tests were conducted to determine whether data for each pair 
of grages were consistent with a difference in distribution of a unidimensional 
latent variable. Again, the hypothesis of unidimensionality was recained. 

Although categorization of the NAEP reading items Is useful for test 
development and reading research, the dimensionality analyses reported here do 
not provide strong empirical evidence for the existence of multiple limensions. 
Especially when considered in light of the robustness research discus ed in 
section 1«1» the results do not contraindicate the application of unid mensional 
item response theory models to the reading data. 



ERLC 



47 



-41- 



References 

Beaton, A. £• (1984). Statistical issues in data analysis fv'>r the 

National Assessment of Educational Progress. Paper presented at the 
annuoi meeting of the American Statistical Association, Philadelphia, 
August 1984. 

Bejar, I. I. (1980). A procedure for investigating the unioxmensionality 
of achievement tests based on item parameter estimates. Journal of 
Educational Measurement . 17 , 283-296. 

Bimbaum, A. (1968). Some latent ttait model<3 and their use in 

inferring an examinee's ability. In F. Lord & M. R. Novick, 
Statistical theories of mental test scores . Reading, MA: Addison- 
Wesley. 

Bock, R. D. , & Aitkin, M. (1981). Marginal maximum likelihood 

estimation of item parameters: Application of an EM algorithm. 

Psychometrika . 46, 443-459. 
Bock, R. D. , GJiJbons, R. T. , and Muraki, E. (1985). ^ull-inf ormation 

item factor analysis (MRC Report No. 85-1). Chicago: National 

Opinion Research Center. 
Bock, R. D. , & Lieberman, M. (1970). Fitting a response model for n 

dichotomously scored items* Psychometrika . 35, 179-197. 
Carroll, J. B. (1945). The effect of difficulty rmd chance success on 

correlations between items and between tests. Psychometrika . 26 . 

347-372. 



ERLC 



48 



-42- 



Carroll, J. (1983) • The difficulty of a test and Its factor 

composition revisited. In H. Walner & S. Messlck (Eds.), Principal s 
of modem psychological measurement . Hillsdale, NJ: Erlbaum, 

Chris tcffersson, A. (1975). Factor analysis of dichotomized variables. 
Psychometrika . 40, 5-32. 

Cook, L. L. , Dorans, N. J., Elgnor, D. R. , & Petersen, N. S. (1985). 
An assessment of the relationship between the assumption of 
unldlmenslonallty and the quality of IRT true-score equating . 
(ETS Research Report 85-30.) Princeton, NJ: Educational Testing 
Service. 

Cook, L. L. V & Elgnor, D. R. (1984). Assessing the dimensionality of 

NAEP reading test items: Confirmatory factor analysis of item parcel 
data. Paper presented at the annual meeting of the American 
Educational Research Association, New Orleans, April, 1984* 

Drasgow, F. , & Parsons, C. K. (1983). Application of unldlmenslonal 
item response theory models to multidimensional data. Appl led 
Psychological Measurement . 2» 189-199. 

Guttman, L. (1953) Image theory for the structure of quantitative 
variates. Psychometrlka . 18 . 277-296. 

Haberman, J. S. (1977). Log**llnear models and frequency tables with small 
expected cells counts. Annals of Statistics . 5^, 1148-1169. 

Hambleton, R. K. , & Rovlnelll, R. J. (in press). Assessing the 
dimensionality of a set of test items. Applied Psychological 
Measurement. 



ERLC 



49 



Harris, C. W. (1962). Some Rao-^uttraan relationships. Psychometrika . 27 > 
247-263. 

Hattie, J. (1984). An empirical study of various indices for determining 

unidimensionality. Multivariate Behavioral Research ^ 19, 49-78. 
Hattie, J. (1985). Methodology review: Assessing unidimensionality of 

tests and items. Applied Psychological Measurement > 9,, 139-164. 
Hendrickson, A. E., & White, P. 0. (1964). PROMAX: A quick method for 

rotation to oblique simple structure. British Journal of Statistical 

Psychology , 17 , 65-70. 
Holland, P. W. (1981). When are item response models consistent with 

observed data? Psychometrika, 46, 79-92. 
Holland, F. W. , & Rosenbaum, P. R. (In press). Conditional association 

and unidimensionality in monotone latent variable models. 

Annals of Statistics . 
Hulin, C. L. , Drasgow, F., & Parsons, C. K. (1983). Item response theory ; 

Application to psychological nreasurement . Homewood, IL. : Dow 

Jcnes-Irwlts. 

Johnson, E. G. (1980). Analysis of NAEP data . Technical report, Educatioti 

Commission of the States, Denver, Colorado. 
Jungeblut, A. (1984). Assessing the dimensionality of NAEP reading 

test items; Linear factor analysis models . Paper presented at the 

annual meeting of the American Educational Research Association, New 

Orleans, April, 1984. 



50 



-44- 

Kaiser, H. F. (1963). Image analysis. In C. W. Harris (ed.). 

Problems in measuring change , pp. 156-166. Madison, WI: University 
of Wisconsin Press. 

Kaiser, H. F. (1970). A second generation little jiffy* Psychometrlka , 
35, 401-415. 

Kaiser, H. F. & Cemy, B. A. (1978). Pseudo-Images and pseudo-antl-lmages 

from the pseudo-Inverse of a singular correlation matrix. 

British Journal of Statistical Psychology . 31, '99-101. 
Kaiser, H. F. , & Cemy, B. A. (1979). Factor analysis of the image 

correlation matrix. Educational and Psychological Measurement , 39, 

711-714. 

Kingston, N. M. , & Dorans, N. J. (1981). The feasibility of using item 
res^nse theory as a p^chometric model for the GRE Aptitude Test . 
(GRE Board Professional Report 79-12.) Princeton, NJ: Educational 
Testing Service. 

Lago, J. A., Buxke, J. S. , Tepping, B. J*, & Hansen, M. H. (1985). 

Report on sample selection, weight ing, and variance estimation; 

NAEP-Year 15 . Rockville, MDs Westat. 
Lord, F. M. (1980). Applications of item response theory to practical 

testing problems . Hillsdale, NJ; Erlbaum* 
Lord, F. M. , & Novick, M. R. (1968). Statistical theories of mental 

test scores . Reading, MA; Addi8on*^esley. 
Mantel, N. , & Haensasel, W. (1959). Statistical aspects of the retrospective 

study of disease. Journal of the National Cancer Institute . 22 . 719-748. 



ERLC 



51 



McDonald, R# P* (1967)* Nonlinear factor analysis* Psychometric 

Monographs (No. 15. ) 
McDonald, P. (1983). Exploratory and confirmatory nonlinear common 

factor analysis. In H. Wainer & S* Messick (EdsO^ Principals of 

modem psychological measurement * Hillsdale, NJ: Erlbaum* 
McDonald, R. P. , & Ahlawat, S. (1974)* Difficulty factors in binary 

data. British Journal of Mathematical and Statistical Psychology , 

22, 82-99 • 

Messick, S. , ^.eaton, A. & Lord, F. (1983). NAEP reconsidered: A new 

design for a new era . (NAEP Report 83-1.) Princeton, NJ: 

Educational Testing Service. 
Mislevy, R. J. (in press). Recent developments in the factor analysis 

categorical variables. Jouiual of Educational Statistics . 
Mislevy, R. J. , & Bock, R. D. (1982). BILOG : Item analysis and test 

scoring with binary logistic models [Computer program]. 

Mooresville, IN: Scientific Software. 
Mosenthal, P. B. (1985). An analysis of NAEP reading assessment 

items . Unpublished manuscript, Syracuse University. 
Mulaik, S. A. (1972). The foundations of factor analysis . New York: 

McGraw-Hill. 

Muthen, B. (1978). Contributions to factor analysis of dichotomous 
variables. Psychometrlka , 43, 551-560. 



52 



-46- 



Reckase, M. D. (1979). Unif actor latent trait models applied to 
multifactor tests: Results and implications. Journal of 
Educational Statisticff > 4, 207-230. 

Rosenbaum, V. K. (1984). Testing the conditional independence and 
monotonicity assumptions of item response theory. 
Psychometrika > 49, 425-435. (a) 

Rosenbaum, P. R. (1984). Are the item responses of two groups of 
examinees consistent with a difference in the distribution of a 
unidimensional latent variable? (Program Statstics Research 
Technical Report No. 84-51). Princeton, NJ: Educational Testing 
Service, (b) 

Stout, W. F. (1984). The statistical assessment of latent trait 

dimensionality in psychological testing . (ONR Report). Urbana- 
Champaign, XL: Department of Mathematics, University of Illinois. 

Traub, R £, & Wolfe, R. 6. (1981). Latent trait theories and the 
assessment of educational achievement. Review of Research 
in Education . 9^ 377-435. 

Wingersky, B. (1984). Gramianizing matrices. Unpublished memorandum. 

Wingersky, H. S. (1983). LOGIST: A program for computing maximum 
likelihood procedures for logistic test models. In R. Hambleton 
(ed.). Applications of item response theory . Vancouver, BC: 
Educational Research Institute of British Columbia. 

Wilson, D. , Wood, R. L. , & Gibbons, R. (1983). TESTFACT: Test 

scoring and item factor analysts [Computer program.] Chicago: 
Scientific Software. 



53 



Appendix 1 

A Procedure for Obtaining a Gramian Matrix that Approximates a 
BIB Correlation Matrix for NAEP Items 

!• Start with the weighted (i.e., incorporating sampling weights) 
BIB covariance matrix. 

2. Substitute zeroes for the negative eigenvalues. (The negative 
eigenvalues constituted 4, 2, and 2 percent of the trace of the missing data 
covariance matrix for grages 9/IV, 13/VIII, and 17/XI, respectively. There 
were no negative eigenvalues for the across**grage matrix.) 

3. Now obtain the "reconstructed" covariance matrix, C*, using the 
following equation: 

where C2 is the matrix of normalized eigenvectors of the original covariance 
riatrix and is a diagonal matrix of eigenvalues, with zeroes substituted for 
the negative eigenvalues. C*"*- g P*'*^9' ^® pseudo-inverse of C*, where 
the elements of **1 are the reciprocals of the corresponding elements of 
for positive elements of and zeroes for zero elements of D*. 

4. It is now possible to obtain a reconstructed correlation matrix, R*, 
corresponding to C*, using ordinary methods. The pseudo-inverse of R* can be 
obtained as follows: 

R*- « s c*-s, 

where S is a diagonal matrix of the square roots of the diagonal elements of 

c*. 
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4 



It is de';lrable to begin with thv*^ covarlance matrix in Steo 1 because 
operating on the correlation matrix, R, directly will produce a reconstru.-ted 
R that do^^s not heve ones on the dia^^-nal. 

The medians of the residuals obtained by subtracting elements of R* from 
elenents of the original R were -007, ,002, and ,003 for grages 9/IV, 13/VIII, 
and 17/IX, respectively. In addition, the eigenstructures fcr the R* matrices 
were v^ry similar to those for the original R's« The methid is inexpensive 
and is not difficult to program. An alternative method of B. Wingersky 
(1984) produced smaller residuals, but was prohibitively expensive to execute* 
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